I am a Junior Data Analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships.
Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From the insights, our team will design a new marketing strategy to convert casual riders into annual members.
Financial Analyst Team have concluded that annual membership is much more profitable than single-ride and full-day passes from their analysis. So to make people opt for the yearly membership, our marketing campaign should urge the casual riders to convert to annual riders. As a solution, we should understand why casual riders would convert to a yearly membership? Based on the insights from the above question, we can achieve the maximum required conversion rate from casual to annual riders.
1. Where is the Data located?
The files are downloaded and stored into System and is further uploaded into RStudio Desktop for analysis.
2. How is Data organized?
Month-wise data segregation is done for the year 2021 (Jan 2021 - Dec 2021).
3. Are there any issues with bias or credibility in this data? Does your data ROCCC?
Yes, data ROCCC (Reliable, Original, Comprehensive, Current, and Cited). As the data has been collected directly from the company’s customers database, hence, there is no issues of bias and credibility for the same reason. It is ‘Reliable, Original, Comprehensive, Current, and Cited’,i.e. ROCCC.
4. How are you accessing licensing, privacy, security, and accessibility?
Data was collected by Motivate International Inc. under the following license https://www.divvybikes.com/data-license-agreement. Also the dataset does not contain any personal information about its customers (or riders) to violate the privacy.
5. How did you verify the data’s integrity?
To follow Data Integrity, data should be - Accurate, Complete, Consistent and Trustworthy. Data is complete as it contains all the required components to measure the entity. It is consistent across the years with every year having its CSV file which is organized in an equal number of columns and same data types. As the credibility was proven before, it is also trustworthy.
6. How does it help to answer your question?
We need to find out new trendz from the data and find out the relationship between annual and casual members on how they have been using the rides and on what basis. New feature are created to deduce such relationship.
7. Are there any problems with the data?
Yes, there were some issues with the data. It consisted of duplicate records which had to be removed and also consisted of ‘N/A’ values which were further removed.
1. What tools are you choosing and why?
Data which i am using is 2020 and it looks pretty huge. In such large datasets scenario, it is preferred to us a Programming Language. For such case, i am using R language to analyse this Case Study.
2. What steps have you taken to ensure that your data is clean?
3. How can you verify that your data is clean and ready to analyze?
To verify that data is clean and ready to analyze: a) Used filter() to check if there were any missing values. b) Used count() to check the unique values of each variable. c) Used duplicated() to check for any duplicate values.
4. Have you documented your cleaning process so you can review and share those results?
Yes, have documented the entire Analysis process from Cleaning to Observations.
Snippets are below.
Install and load necessary Packages:
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/mjaff/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mjaff\AppData\Local\Temp\RtmpuO7FwS\downloaded_packages
install.packages("janitor", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/mjaff/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'janitor' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mjaff\AppData\Local\Temp\RtmpuO7FwS\downloaded_packages
install.packages("scales", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/mjaff/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'scales' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\mjaff\AppData\Local\Temp\RtmpuO7FwS\downloaded_packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
df1 <- read_csv("202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv("202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df3 <- read_csv("202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df4 <- read_csv("202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df5 <- read_csv("202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df6 <- read_csv("202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df7 <- read_csv("202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df8 <- read_csv("202108-divvy-tripdata.csv")
## Rows: 804352 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df9 <- read_csv("202109-divvy-tripdata.csv")
## Rows: 756147 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df10 <- read_csv("202110-divvy-tripdata.csv")
## Rows: 631226 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df11 <- read_csv("202111-divvy-tripdata.csv")
## Rows: 359978 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
df12 <- read_csv("202112-divvy-tripdata.csv")
## Rows: 247540 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
cyclistic_df <- rbind(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)
cyclistic_df_rem_empty <- remove_empty(cyclistic_df, which = c("rows","cols"))
count(cyclistic_df)
count(cyclistic_df_rem_empty)
count(filter(cyclistic_df_rem_empty, start_station_name==''),start_station_name, member_casual,sort=TRUE)
cyclistic_df %>%
count(rideable_type)
cyclistic_df_rem_empty <- na.omit(cyclistic_df)
count(cyclistic_df_rem_empty)
cyclistic_df_no_dup <- cyclistic_df_rem_empty[!duplicated(cyclistic_df_rem_empty$ride_id), ]
1. How should you organize your data to perform analysis on it?
As the data was separated in different CSV’s, i combined them into a single Data Frame.
I also created new features to support my Analysis.
a. riding_time
cyclistic_clean_df <- cyclistic_df_no_dup
cyclistic_clean_df <- cyclistic_clean_df %>%
mutate(riding_time = as.numeric(ended_at-started_at)/60)
cyclistic_clean_df
b. year_month
cyclistic_clean_df <- cyclistic_clean_df %>%
mutate(year_month = paste(strftime(cyclistic_clean_df$started_at, "%Y"), "-",
strftime(cyclistic_clean_df$started_at, "%m"), "-",
strftime(cyclistic_clean_df$started_at, "%b")))
cyclistic_clean_df
removing data of year ‘2022’ from the data frame, as we are focused on 2021 data
cyclistic_clean_df <- filter(cyclistic_clean_df, year_month != "2022%")
cyclistic_clean_df
c. weekday
cyclistic_clean_df <- cyclistic_clean_df %>%
mutate(weekday = weekdays(cyclistic_clean_df$ended_at))
cyclistic_clean_df